Discrete Distribution Family

Multivariate discrete distribution

Yu Cheng Hsu

Intended learning outcomes

Define the properties of Categorical, Multinomial, and Dirichlet distributions.
Explain the use of the Dirichlet distribution as a conjugate prior and the concept of pseudo-counts.

Categorical distribution

If a random variable \(X\) has \(k\) distinct outcomes (\(k \geq 2\)), then \(X\) follows a Categorical distribution.

Support: \(x \in \{1,\dots, k\}\)
Parameters:
- \(p_1, p_2, \dots, p_k\) (noting that \(\sum_{i=1}^k p_i = 1\) and \(p_i \ge 0\))
pmf

You can view the categorical distribution as a multivariate version of the Bernoulli distribution.
There are only \(k-1\) free parameters since \(p_k = 1 - \sum_{i=1}^{k-1} p_i\).

Example

Phenotypes (e.g. A, B, O blood types)
Nucleotide bases (e.g. A, T, C, G)

Multinomial distribution

Similar to the univariate case, if we summarize the counts from repeated categorical trials, we obtain the Multinomial distribution.

Support: \(x_1,\dots,x_k \mid \sum_{i=1}^k x_i =n,\; x_i \in \mathbb{N}\)
Parameters:
- \(p_1, p_2, \dots, p_k\) (noting \(\sum_{i=1}^k p_i = 1\) and \(p_i \ge 0\))
- \(n\): number of trials
pmf

Again, there are only \(k-1\) free parameters since \(p_k = 1 - \sum_{i=1}^{k-1} p_i\).

Example

Summary of categorical data

Dirichlet distribution

The Dirichlet distribution is the conjugate prior of the Categorical and Multinomial distributions. It describes the belief about the probability vector \(p\) in these models.

Support: \(p_1,\dots, p_k\)
Parameters: \(\alpha_1,\dots, \alpha_k\)
pdf \[ \frac{1}{B(\mathbf{\alpha})}\prod_{i=1}^{k} p_i^{\alpha_i-1} \]

You can view the Dirichlet distribution as a multivariate version of the Beta distribution.

Bayesian estimation

Motivation: We tries to incorporate our knowledge and experience to estimate the parameters of a distribution.
From Bayes’ theorem, we can use the prior distribution to update our belief about the parameters after observing the data. In mathematical terms, we have

in short, in most of the calculation, we can write

Dirichlet as categorical distribution prior

Set up

\(X = x_1,\dots,x_N\) are \(N\) independent observations.
Drawn from a Categorical distribution with \(\textbf{p}=(p_1,\dots,p_K)\) and \(K\) classes.
Let \(n_k\) be the count of observations in category \(k\).

Prior

A Dirichlet prior \(P(\textbf{p} \mid \mathbf{\alpha}) = \text{Dir}(\textbf{p} \mid \mathbf{\alpha}) = \frac{1}{\text{B}(\mathbf{\alpha})} \prod_{i=1}^K p_i^{\alpha_i - 1}\)

From Bayes’ theorem, we have

For simplification, let \(c=(c_1,\dots,c_k)\) as the count of observations in each category, and the posterior distribution of the parameter is

The Posterior Predictive Distribution

Thep posterior distribution has the kernel (functional form) of a Dirichlet distribution with updated parameters \(\mathbf{\alpha}'\):

Interpretation: adding \(\alpha_i\) pseudo-counts to the observed counts \(c_i\) for each category \(i\).

Posterior predictive distribution

Intuitively speaking if we want to guess the probability of the next observation \(x_{N+1}\) being category \(j\) is still a Categorical distribution,

but we need to use the updated parameters \(\mathbf{\alpha}'\) to calculate the :

Illustration of Dirichlet Prior

Dirichlet prior adds a psuedo \(\alpha\) observations. (The observed data is [8,2,2])

Dirichlet as a multinomial distribution prior

Set up

\(X = x_1,\dots,x_n\) are observations drawn from a Multinomial distribution with \(P=(p_1,\dots,p_K)\) and \(K\) classes.

Prior

A Dirichlet prior \(P(\textbf{p} \mid \mathbf{\alpha}) = \text{Dir}(\textbf{p} \mid \mathbf{\alpha}) = \frac{1}{\text{B}(\mathbf{\alpha})} \prod_{i=1}^K p_i^{\alpha_i - 1}\)

The probability of observing this data is

Using Bayes’ Rule, the posterior is proportional to the Likelihood times the Prior: \[ P(\textbf{p} \mid \textbf{X}, \mathbf{\alpha}) \propto P(\textbf{X} \mid \textbf{p}) P(\textbf{p} \mid \mathbf{\alpha}) \]

Dropping the constants (terms not involving \(\textbf{p}\)) to find the kernel:

Illustration

Thep posterior distribution has the kernel (functional form) of a Dirichlet distribution with updated parameters \(\mathbf{\alpha}'\):

Interpretation: The \(\alpha_i\) values act as “imaginary” observations seen before the experiment.

Interpretation of Dirichlet prior

	Categorical	Multinomial
Prior	Dirichlet	Dirichlet
Posterior	\(\prod_{k=1}^Kp_k^{n_k+\alpha_k-1}\)	\(\prod_{k=1}^Kp_k^{x_k+\alpha_k-1}\)
Posterior predictive	Categorical	Dirichlet-multinomial

Interpretation:
- Adding a Dirichlet prior is equivalent to adding \(\alpha\) pseudo-observations to the data.
- As the number of observations increases, the influence of the prior decreases.

Application in bioinformatics

Motif discovery

Metagenomics

Gut microbiome composition of different types from Arumugam et al. (2011)

Connection with linear regression (Optional)

Generalized linear model

Limitations of ordinary linear regression

Count data (Multinomial/Poisson)
Dichotomous or Multinomial outcomes (Bernoulli)
Proportions (values between 0 and 1)

A generalized linear model (GLM) can model different kinds of data.

Components of GLM

Random components
- Identify \(Y\) and its distribution
Systematic components
- Linear predictor \(\alpha + \beta_1 x_1 + \dots + \beta_k x_k\)
Link function
- Given \(\mu=E(Y)\), the link function \(g(\cdot)\) satisfies \[g(\mu)=\alpha+\beta_1x_1+\dots+\beta_kx_k\]

Logistic regression

In logistic regression (binomial outcomes), the link function is

Logit link function
\[ \log\frac{p(x)}{1-p(x)}=\alpha+\beta x \]

Transform them back to get \(p(x)\)

\[ p(x)=\frac{e^{\alpha+\beta x}}{1+e^{\alpha+\beta x}} \]

For a dichotomous predictor \(x\) with values 0 and 1, \(\beta\) is the log odds ratio (\(\log(\text{OR})\)).

Discussion: if we model \(p(x)=\alpha+\beta x\), is it causing any trouble?

Potential estimate for \(p(x)>1\) or \(p(x)<0\)

Multinomial logistic regression

How do we compare multiple categories?
- 1 vs. n
- n vs. n
There are \(k(k-1)/2\) pairs of comparisons, but actually, you only need \(k-1\) non-redundant comparisons.

Let the category \(K\) as the baseline, for each comparison of \(k = 1,\dots,k-1\)

Baseline-category logit model are \[ \log(\frac{p_k(x)}{p_K(x)})=\alpha_k+\beta_kx \quad k=1,\dots,K-1 \]

Interpretation of \(\beta\)

The model assumes \(\ln(\frac{p(x)}{1-p(x)})=\alpha+\beta x\), so for a given change \(\Delta x\),

The odds at \(x + \Delta x\) is

\[ \frac{p(x + \Delta x)}{1-p(x + \Delta x)}=\exp(\alpha+\beta(x+\Delta x)) = \exp(\alpha+\beta(x))\exp(\beta\Delta x) \]

The odds ratio (OR) for \(x\) is

\[ \text{OR}=\frac{\text{odds at}(x+\Delta x)}{\text{odds at}(x)}=\frac{\exp(\alpha+\beta(x+\Delta x))}{\exp(\alpha+\beta(x))} = \exp(\beta\Delta x) \]

\(\beta\) is the change in the log odds when x increases by 1 unit.
Confidence interval of OR is not symmetric to the mean.

How about alpha ?

Connections with contingency tables

Logistic regression v.s. Contingency table

	Logistic regression	Contingency table
Estimating	OR	OR
Test on esstimation	t-test/\(\chi^2\) test	Fisher’s exact test/\(\chi^2\) test

Advantage of logistic regression
- Controlling other confounders
Estimating OR for continuous variables
Compatible with multinomial outcomes

Back to the kidney stone example

In clinical practice, for stone diameters less than 2cm, doctors usually conduct PCNL, while open surgery is used for stone diameters of 2cm or greater.

Stone size	Treatment	Success	Fail
Small	OS	81	6
Small	PCNL	234	36
Large	OS	192	71
Large	PCNL	55	25

Model: (Success/Fail) ~ Stone size + Treatment


-- Log Odds Ratios & CI w/o Control Stone Size --
              2.5%  97.5%  Odds Ratio
Intercept     1.28   1.83        1.56
Treatment_OS -0.66   0.08       -0.29

-- Log Odds Ratios & CI with Control Stone Size  --
                  2.5%  97.5%  Odds Ratio
Intercept         1.60   2.27        1.94
Stone_Size_Large -1.73  -0.79       -1.26
Treatment_OS     -0.09   0.81        0.36

Connections to continuous variables

Ordinal logistic regression

For ordinal data (e.g., a scale with scores “No”, “Neutral”, and “Yes”), the following rules apply:

\[ P(x \leq k)=p(x=1) + \dots + p(x=k), \text{where } p(x\leq K) = 1 \]

Intuitively, we can think that the regression is measuring the same thing, but the cutoff for making decisions is different. So, we only need to estimate the cumulative logits:

\[ \log(\frac{p(x\leq j)}{p(x > j)}) =\alpha_j+\beta x \]

Ordinal logistic regression

Illustration: \(\alpha_{-1}=-3,\alpha_{0}=2, \beta=1.5\)

Implications

Finding cut-offs for different ordinal outcomes
If we extend to really continuous case, then we do not need that cut-offs to specific class

Bibliography

Arumugam, Manimozhiyan, Jeroen Raes, Eric Pelletier, Denis Le Paslier, Takuji Yamada, Daniel R Mende, Gabriel R Fernandes, et al. 2011. “Enterotypes of the Human Gut Microbiome.” Nature 473 (7346): 174–80.

D’haeseleer, Patrik. 2006. “How Does DNA Sequence Motif Discovery Work?” Nature Biotechnology 24 (8): 959–61.